-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split splines across multiple GPUs on a given node #1101
Conversation
…ite the right final results.
Conflicts: src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp
…ction). Also restored non-split splines performance back to reference.
Can one of the maintainers verify this patch? |
1 similar comment
Can one of the maintainers verify this patch? |
OK to test |
…riable name typo for QMC_COMPLEX case.
Rhea (gpu) should compile and run fine now. |
In the v3.5, I changed the way of the spline tables collection using in-place gather in order to require no extra scratch memory. MPI is notorious for using 32 bit integer range and you might have an inferior MPI implementation. We need to find a workaround hopefully this is not a bug. |
I'm trying to understand the algorithm. But directly reading the code may not be the best way.
|
To your earlier question: On SummitDev with 4 MPI ranks: buffer->coefs_size=446,772,240 ; ncol = 816 (Working on reducing 1<<20 right now). Algorithm questions:
|
@ye-luo Also, I left your UVM code in place and it should still work even with the split splines... |
@atillack Great work. |
@ye-luo Agreed. |
@ye-luo You are spot-on with your comment on the spline table MPI problem. I went to 1<<16 (to have a lower bound) and that runs through fine with 4 MPI ranks on SummitDev. |
@ye-luo 1<<19 of course works too... (446,772,240 / 816 > (1<<19)) |
@ye-luo To your question of how much faster this implementation is compared to your UVM code: I set the available spline memory to 1738 MB (half of the spline table) so half ends up on the GPU and the other half on the CPU for the 128 atom NiO system. When I do this with split splines using two MPI ranks (this implementation) on SummitDev the DMC block (5 block of 5 steps) takes 149 seconds (within typical fluctuation of the number above). Your UVM code also using two MPI ranks each using the same amount of memory for the spline table on the GPU (and the rest on the host) takes a total of 364 seconds. These runs were on the same node run after another. |
…. 2 MPI ranks on the same GPU, rank 0 initializes half the memory, rank 1 the other half)
Almost finished. |
@prckent I am not sure how to properly set up a test case. In theory, only gpusharing="yes" needs to be set in the section of a test case with enough orbitals (like NiO) but Cuda MPS also needs to be present which is not within the test framework's control (I think). |
Is there a way in the code to test if MPS is enabled or not? If user turns on the flag but the MPS is not ready, the code can turn off this feature or abort the code. |
Keep it simple. Lets worry about MPS settings when we have proven that they are needed. For oxygen, where we run the nightlies, I don't think we need be concerned. Later PRs can improve the situation as needed. |
… Summit). This allows a warning in case CUDA MPS is not available and the user tries to use split splines.
Conflicts: src/einspline/multi_bspline_cuda_d_impl.h src/einspline/multi_bspline_cuda_s_impl.h src/einspline/multi_bspline_cuda_z_impl.h
… as this is safer and is also necessary on Summit.
…ed once per node as contention may skew results.
Currently checking this on oxygen. If it works, I think we should merge. The tests and docs can be updated later. Enough tooth pulling via this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wait for comments on Friday before merging.
Note: Unable to get this to work with MPS on oxygen. Config instructions at https://docs.nvidia.com/deploy/mps/index.html generate a segv. Probably I don't have the correct/magical combination of cuda aware openmpi and environment settings. Without MPS running the code is appropriately benign. Can someone confirm that this is working? I currently don't have access to a fully updated multiple GPU machine. |
@prckent I can run on SummitDev and Summit. Documentation is updated. Can you send me the output which segfaults? |
The LaTeX is slightly broken. I will push a fix and improvements, then merge. MPS investigations on oxygen will not occur for a while. Having Summit work is the key thing due to upcoming INCITE usage. (Hence #1054 is an important problem.) |
Here is my work-in-progress version of the split splines code for GPU with promising performance numbers:
Currently, only the spline type used for our NiO example is fully implemented. I will of course implement the missing bits once we finalized the code here.
In order to use the code and split the spline data memory across multiple GPUs the following needs to be done:
I did test the original code based on an older development tree on both SummitDev and Summit with 2,4 and 3,6 GPUs, respectively (this is the data above). After updating to the current development tree I could only run on two GPUs on SummitDev to confirm performance levels are still the same - with four GPUs I get an MPI error (coll:ibm:module: datatype_prepare_recvv overflowed integer range triggered by comm->gatherv_in_place from gatherv in the new spline adaptor code). I get the exact same error with vanilla 3.5.0 on SummitDev. Since I can't run on Summit right now I can't verify if this is a regression in 3.5.0 (3.4.0 did not have these issues on Summit) or just the particular MPI version on SummitDev.
To finish: